Introduction to Web Scraping and Data Management for Social Scientists

Session 3: Scraping Static Web Pages

Johannes B. Gruber

2023-07-26

Introduction

The Plan for Today

In this session, we trap some docile data that wants to be found. We will:

  • Go over some parsing examples:
    • Wikipedia: World World Happiness Report
  • Discuss some examples of good approaches to data wrangling
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

Joe Caione via unsplash.com

Example: World Happiness Report

Use your Browser to Scout

Use your Browser’s Inspect tool

Note: Might not be available on all browsers; use Chromium-based or Firefox.

Use rvest to scrape

library(rvest)
library(tidyverse)

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/wiki/World_Happiness_Report")

# 2. Parse
happy_table <- html |> 
  html_elements(".wikitable") |> # select the right element
  html_table() |>                # special function for tables
  pluck(3)                       # select the third table

# 3. No wrangling necessary
happy_table
# A tibble: 153 × 9
   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
            <int> <chr>               <dbl>            <dbl>            <dbl>
 1              1 Finland              7.81             1.28             1.5 
 2              2 Denmark              7.65             1.33             1.50
 3              3 Switzerland          7.56             1.39             1.47
 4              4 Iceland              7.50             1.33             1.55
 5              5 Norway               7.49             1.42             1.50
 6              6 Netherlands          7.45             1.34             1.46
 7              7 Sweden               7.35             1.32             1.43
 8              8 New Zealand          7.3              1.24             1.49
 9              9 Austria              7.29             1.32             1.44
10             10 Luxembourg           7.24             1.54             1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
#   `Freedom to make life choices` <dbl>, Generosity <dbl>,
#   `Perceptions of corruption` <dbl>
## Plot relationship wealth and life expectancy
ggplot(happy_table, aes(x = `GDP per capita`, y = `Healthy life expectancy`)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Exercises 1

  1. Get the table with 2023 opinion polling for the next United Kingdom general election from https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election
  2. Wrangle and plot the data opinion polls

Example: UK prime ministers on Wikipedia

Use your Browser to Scout

Use rvest to scrape

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it

# 2. Parse
pm_table <- html |> 
  html_element(".wikitable:contains('List of prime ministers')") |>
  html_table() |> 
  as_tibble(.name_repair = "unique") |> 
  filter(!duplicated(`Prime ministerOffice(Lifespan)`))

# 3. No wrangling necessary
pm_table
# A tibble: 75 × 11
   Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
   <chr>        <chr>        <chr>                          <chr>               
 1 "Portrait"   "Portrait"   Prime ministerOffice(Lifespan) start               
 2 "​"           ""           Robert Walpole[27]MP for King… 3 April1721         
 3 "​"           ""           Spencer Compton[28]1st Earl o… 16 February1742     
 4 "​"           ""           Henry Pelham[29]MP for Sussex… 27 August1743       
 5 "​"           ""           Thomas Pelham-Holles[30]1st D… 16 March1754        
 6 "​"           ""           William Cavendish[31]4th Duke… 16 November1756     
 7 "​"           ""           Thomas Pelham-Holles[32]1st D… 29 June1757         
 8 ""           ""           John Stuart[33]3rd Earl of Bu… 26 May1762          
 9 ""           ""           George Grenville[34]MP for Bu… 16 April1763        
10 ""           ""           Charles Watson-Wentworth[35]2… 13 July1765         
# ℹ 65 more rows
# ℹ abbreviated name: ¹​`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
#   `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
#   Party <chr>, Government <chr>, MonarchReign <chr>
<td rowspan="4">
  <span class="anchor" id="18th_century"></span>
   <b>
     <a href="/wiki/Robert_Walpole" title="Robert Walpole">Robert Walpole</a>
   </b>
   <sup id="cite_ref-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46_28-0" class="reference">
     <a href="#cite_note-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46-28">[27]</a>
   </sup>
   <br>
   <span style="font-size:85%;">MP for <a href="/wiki/King%27s_Lynn_(UK_Parliament_constituency)" title="King's Lynn (UK Parliament constituency)">King's Lynn</a>
   <br>(1676–1745)
  </span>
</td>
links <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_attr("href")
title <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_text()
tibble(name = title, link = links)
# A tibble: 90 × 2
   name                 link                                             
   <chr>                <chr>                                            
 1 Robert Walpole       /wiki/Robert_Walpole                             
 2 George I             /wiki/George_I_of_Great_Britain                  
 3 George II            /wiki/George_II_of_Great_Britain                 
 4 Spencer Compton      /wiki/Spencer_Compton,_1st_Earl_of_Wilmington    
 5 Henry Pelham         /wiki/Henry_Pelham                               
 6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 7 William Cavendish    /wiki/William_Cavendish,_4th_Duke_of_Devonshire  
 8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 9 George III           /wiki/George_III                                 
10 John Stuart          /wiki/John_Stuart,_3rd_Earl_of_Bute              
# ℹ 80 more rows

Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work

Exercises 2

  1. For extracting text, rvest has two functions: html_text and html_text2. Explain the difference. You can test your explanation with the example html below.
html <- "<p>This is some text
         some more text</p><p>A new paragraph!</p>
         <p>Quick Question, is web scraping:

         a) fun
         b) tedious
         c) I'm not sure yet!</p>" |> 
  read_html()
  1. How could you convert the links objects so that it contains actual URLs?
  2. How could you add the links we extracted above to the pm_table to keep everything together?

Example: IC2S2 2023

What do we want

  • General goal in the course: we want to build a database of conference attendance and link this to researchers
  • So for each website:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)
  • In the next session we will also collect some researcher data

Let’s scrape! Exercise

I started the code below, now it’s your turn to finish it:

# 1. Request & collect raw html
html <- read_html("https://www.ic2s2.org/program.html")

sessions <- html |> 
  html_elements(".nav_list")

# 2. Parse
talks <- sessions |> 
  html_elements("li")

talks_titles <- talks |> 
  html_elements("") |> 
  html_text()

talks_speaker <- talks |> 
  html_elements("") |> 
  html_text()

talks_authors <- talks |> 
  html_elements("") |> 
  html_text()

Let’s scrape! Solution

# 1. Request & collect raw html
html <- read_html("https://www.ic2s2.org/program.html")

sessions <- html |> 
  html_elements(".nav_list")

# 2. Parse
talks <- sessions |> 
  html_elements("li")

talks_titles <- talks |> 
  html_elements("b") |> 
  html_text()

talks_speaker <- talks |> 
  html_elements("u") |> 
  html_text()

talks_authors <- talks |> 
  html_elements("i") |> 
  html_text()

head(talks_titles)
[1] "9:45 - The dynamics of writing style in scientific collaborations [705]"                                                                        
[2] "9:51 - Performative Capitalism in Silicon Valley: Examining Performative and Substantive Political Behaviors of U.S. Technology Companies [581]"
[3] "9:57 - Spillover of Antisocial Behavior from Fringe Platforms: The Unintended Consequences of Community Banning [537]"                          
[4] "10:03 - The Microtargeting Manipulation Machine  [383]"                                                                                         
[5] "10:09 - Sentence-Level Explanations of Word Embeddings Associations [518]"                                                                      
[6] "10:15 - Mapping the Effect of Altruistic Punishment on Cooperation [730]"                                                                       
head(talks_speaker)
[1] "Jonas L Juul"     "Chloe Ahn"        "Giuseppe Russo"   "Almog Simchon"   
[5] "Arianna Pera"     "Mohammed Alsobay"
head(talks_authors)
[1] "Jonas L Juul, Jon Kleinberg"                                         
[2] "Chloe Ahn, Xinyi Wang"                                               
[3] "Giuseppe Russo, luca verginer, Manoel Horta Ribeiro, Giona Casiraghi"
[4] "Almog Simchon, Adam Sutton, Matthew Edwards, Stephan Lewandowsky"    
[5] "Arianna Pera, Manuel Vimercati, Matteo Palmonari"                    
[6] "Mohammed Alsobay, Abdullah Almaatouq, David G. Rand, Duncan J. Watts"

We can now tray to wrangle the data into a suitable format:

# 3. Wrangle
ic2s2_program <- tibble(talks_titles, talks_speaker, talks_authors)
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 606: Existing data.
• Size 593: Column at position 2.
ℹ Only values of size one are recycled.
length(talks_titles)
[1] 606
length(talks_speaker)
[1] 593
length(talks_authors)
[1] 606

Let’s scrape (a little better)!

If several elements are children of a element that represents our “case”/data row, we can loop over the identified parent elements. For this, let’s write a function that parses one talk:

talks <- sessions |> 
  html_elements("li")

parse_talks <- function(x) {
  title <- x |> 
    html_elements("b") |> 
    html_text()
  
  speaker <- x |> 
    html_elements("u") |> 
    html_text()
  
  author <- x |> 
    html_elements("i") |> 
    html_text()
  tibble(title, speaker, author)
}

parse_talks(talks[[1]])
# A tibble: 1 × 3
  title                                                           speaker author
  <chr>                                                           <chr>   <chr> 
1 9:45 - The dynamics of writing style in scientific collaborati… Jonas … Jonas…

Now we can use the function to loop over all talks:

talks_data <- map(talks, parse_talks) |> 
  bind_rows() |> 
  separate(col = title, into = c("time", "title"), sep = " - ")
talks_data
# A tibble: 593 × 4
   time  title                                                    speaker author
   <chr> <chr>                                                    <chr>   <chr> 
 1 9:45  The dynamics of writing style in scientific collaborati… Jonas … Jonas…
 2 9:51  Performative Capitalism in Silicon Valley: Examining Pe… Chloe … Chloe…
 3 9:57  Spillover of Antisocial Behavior from Fringe Platforms:… Giusep… Giuse…
 4 10:03 The Microtargeting Manipulation Machine  [383]           Almog … Almog…
 5 10:09 Sentence-Level Explanations of Word Embeddings Associat… Ariann… Arian…
 6 10:15 Mapping the Effect of Altruistic Punishment on Cooperat… Mohamm… Moham…
 7 10:21 Social Contagion in Science [725]                        Satyak… Sara …
 8 10:27 Big Moves: Tactics for early career success [710]        Isabel… Isabe…
 9 9:45  The motherhood penalty in mobility [418]                 Silvia… Silvi…
10 9:51  Momentary Symbols: Tracing the Visual Expressions of Pr… Emil B… Emil …
# ℹ 583 more rows

But What about Sessions?

titles <- html |> 
  html_elements(".wrapper.style2") |> 
  html_elements(":not(header)>h2")
length(titles)
[1] 48
sessions <- html |> 
  html_elements(".wrapper.style2") |> 
  html_elements("ul")
length(sessions)
[1] 48
talks_data <- map2(titles, sessions, function(x, y) {
  chair <- x |> 
    html_element("i") |> 
    html_text()
    
  session_title <- x |> 
    html_text() |> 
    str_remove(chair)
  
  y |> 
    html_elements("li") |> 
    map(parse_talks) |> 
    bind_rows() |> 
    mutate(session = session_title, chair = chair, .before = 1)
  
}) |> 
  bind_rows()
talks_data
# A tibble: 284 × 5
   session                                            chair title speaker author
   <chr>                                              <chr> <chr> <chr>   <chr> 
 1 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:0… Marile… Maril…
 2 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:1… Tianch… Tianc…
 3 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:3… Nikola… Nikol…
 4 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:4… Gianma… Corra…
 5 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 12:0… Homa H… Homa …
 6 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 12:1… Li Zha… Li Zh…
 7 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:0… Melani… Melan…
 8 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:1… Takahi… Takah…
 9 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:3… Edwin … Dinge…
10 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:4… Veniam… Manoe…
# ℹ 274 more rows
# echo: false
saveRDS(talks_data, "../data/ic2s2_2023_data.rds")

Example: 2023 ECPR General Conference

What do we want

  • General goal in the course: we want to build a database of conference attendance and link this to researchers
  • So for each website:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)
  • In the next session we will also collect some researcher data

Let’s investigate the site a little

:::: {.columns}

  • Interestingly, we can sort and filter the data
  • This is an indicator that the site very likely uses a JavaScript function
  • That gets the data from somewhere
html <- readLines("https://ecpr.eu/Events/AcademicProgramme/Schedule?eventID=214")
object.size(html) |> 
  format(units = "MB")
[1] "2 Mb"

Let’s Try to get the json: why?

Why:

  • JavaScript Object Notation (json) is a way of storing complicated nested data
  • data is put into a character string that indicates object types and relation of objects
  • R knows how to read json strings/files and can easily process them
library(jsonlite)
json_string <- list(x = 1:10, y = list(z = 1:10, a = LETTERS[1:10])) |> 
  toJSON()
json_string
{"x":[1,2,3,4,5,6,7,8,9,10],"y":{"z":[1,2,3,4,5,6,7,8,9,10],"a":["A","B","C","D","E","F","G","H","I","J"]}} 
fromJSON(json_string)
$x
 [1]  1  2  3  4  5  6  7  8  9 10

$y
$y$z
 [1]  1  2  3  4  5  6  7  8  9 10

$y$a
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
  • essentially we seem to get pre-packaged data here :)

Getting the json string

We use the standard selector to get the script inside the scheduleGrid_Container:

html <- readLines("https://ecpr.eu/Events/AcademicProgramme/Schedule?eventID=214", encoding = "windows-1252") |>
  paste(collapse = "\n") |>
  read_html()

json_string <- html |>
  html_element("#scheduleGrid_Container script") |>
  html_text()

Dont’t try to print the entire string, since it is quite large and might crash RStudio!

nchar(json_string)
[1] 1985406
substr(json_string, 1, 250)
[1] "DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore({\"data\":[{\"ID\":12761,\"EventID\":214,\"StartDate\":new Date(2023, 8, 4),\"EndDate\":new Date(2023, 8, 4),"

Note: In newer versions of R read_html seems to have a bug that causes problems with special encodings, which is why I load the data with the base function readLines(). ^

Parsing the json string

Let’s give it a try:

fromJSON(json_string)
Error: lexical error: invalid char in json text.
                                       DevExpress.utils.readyCallbacks
                     (right here) ------^

The problem is that this is not just the json itself, but also some Javascript to sort and render it:

substr(json_string, 1, 250)
[1] "DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore({\"data\":[{\"ID\":12761,\"EventID\":214,\"StartDate\":new Date(2023, 8, 4),\"EndDate\":new Date(2023, 8, 4),"

If you have a little experience with json, you can spot here that the actual data starts only here:

data.ArrayStore({\"data\"
(right here) ---^

Parsing Cleaning the json string

json_clean <- json_string |>
  str_remove_all(fixed("DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore("))
fromJSON(json_clean)
Error: lexical error: invalid string in json text.
          61,"EventID":214,"StartDate":new Date(2023, 8, 4),"EndDate":
                     (right here) ------^

It looks like there is some more Javascript code in the data. We can convert the date function into proper json data with a little regular expression:

str_replace_all('"StartDate":new Date(2023, 8, 4)', "new Date\\((\\d+), (\\d+), (\\d+)\\)", "\"\\1-\\2-\\3\"")
[1] "\"StartDate\":\"2023-8-4\""

Perfect, let’s try this again:

json_clean <- json_string |>
  str_remove_all(fixed("DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore(")) |> 
  str_replace_all("new Date\\((\\d+), (\\d+), (\\d+)\\)", "\"\\1-\\2-\\3\"")
fromJSON(json_clean)
Error: parse error: trailing garbage
          03e","IsInItinerary":false}]})},"showBorders":true,"showColu
                     (right here) ------^

Blimey! More garbage :(

Parsing Cleaning the json string: Second attempt

It seems there is more problems at the end of the json string. So let’s have a look:

len <- nchar(json_string)
substr(json_string, len - 1100, len) # It took a few guesses to circle in on the 1100
[1] "\\u003e\\r\\n\\u003cbr/\\u003e\\r\\n\\u003ca href=\\u0022/profile/ElisaBellè\\u0022 target=\\u0022_blank\\u0022\\u003eElisa\\u0026nbsp;Bellè\\u003c/a\\u003e\\r\\n\\u003cbr/\\u003e\",\"IsInItinerary\":false}]})},\"showBorders\":true,\"showColumnLines\":false,\"showRowLines\":true,\"paging\":{\"pageSize\":20},\"columnHidingEnabled\":true,\"columnAutoWidth\":true,\"wordWrapEnabled\":true,\"searchPanel\":{\"visible\":true,\"searchVisibleColumnsOnly\":true},\"columns\":[{\"dataField\":\"StartDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"EndDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"Activity\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":false},{\"dataField\":\"Papers\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":1},{\"dataField\":\"People\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":0},{\"name\":\"ItineraryButton\",\"cssClass\":\"align-top\",\"caption\":\" \",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":2,\"calculateCellValue\":function(data) { return scheduleGrid.formatItineraryButton(data); },\"alignment\":\"center\"}]});}).bind(this, jQuery));"

It seems Elisa\\u0026nbsp;Bellè is still part of the data and R guess pretty well which part is the garbage. Let’s try to remove the trailing Javascript bits.

Perfect, let’s try this again:

json_clean <- json_string |>
  str_remove_all(fixed("DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore(")) |> 
    str_remove_all(fixed(")},\"showBorders\":true,\"showColumnLines\":false,\"showRowLines\":true,\"paging\":{\"pageSize\":20},\"columnHidingEnabled\":true,\"columnAutoWidth\":true,\"wordWrapEnabled\":true,\"searchPanel\":{\"visible\":true,\"searchVisibleColumnsOnly\":true},\"columns\":[{\"dataField\":\"StartDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"EndDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"Activity\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":false},{\"dataField\":\"Papers\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":1},{\"dataField\":\"People\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":0},{\"name\":\"ItineraryButton\",\"cssClass\":\"align-top\",\"caption\":\" \",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":2,\"calculateCellValue\":function(data) { return scheduleGrid.formatItineraryButton(data); },\"alignment\":\"center\"}]});}).bind(this, jQuery));")) |>
  str_replace_all("new Date\\((\\d+), (\\d+), (\\d+)\\)", "\"\\1-\\2-\\3\"")
json_parsed <- fromJSON(json_clean)

Smashing! R converted the data succesfully!

Wrangling the data: first look

Let’s have a look at the data now:

glimpse(json_parsed)
List of 1
 $ data:'data.frame':   607 obs. of  8 variables:
  ..$ ID           : int [1:607] 12761 12045 12572 12400 12085 12425 12076 12411 12099 12058 ...
  ..$ EventID      : int [1:607] 214 214 214 214 214 214 214 214 214 214 ...
  ..$ StartDate    : chr [1:607] "2023-8-4" "2023-8-4" "2023-8-4" "2023-8-4" ...
  ..$ EndDate      : chr [1:607] "2023-8-4" "2023-8-4" "2023-8-4" "2023-8-4" ...
  ..$ Activity     : chr [1:607] "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\nExecutive Committee Meeting (For Executive "| __truncated__ "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\n<a href=\"/Events/Event/PanelDetails/12825\"| __truncated__ "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\n<a href=\"/Events/Event/PanelDetails/12587\"| __truncated__ "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\n<a href=\"/Events/Event/PanelDetails/12861\"| __truncated__ ...
  ..$ Papers       : chr [1:607] "" "<a href=\"/Events/Event/PaperDetails/70630\"target=\"_blank\">Critique of Critical Policy Discourse Analysis</a"| __truncated__ "<a href=\"/Events/Event/PaperDetails/67670\"target=\"_blank\">Elites in times of epochal change. Empirical evid"| __truncated__ "<a href=\"/Events/Event/PaperDetails/67347\"target=\"_blank\">Does Political Polarization Affect Migrant Integr"| __truncated__ ...
  ..$ People       : chr [1:607] "" "Chair: \r\n<a href=\"/profile/MichaelFarrelly\" target=\"_blank\">Michael&nbsp;Farrelly</a>\r\n<br/>\r\nCo-Chai"| __truncated__ "Chair: \r\n<a href=\"/profile/UrsulaHoffmann-Lange\" target=\"_blank\">Ursula&nbsp;Hoffmann-Lange</a>\r\n<br/>\"| __truncated__ "Chair: \r\n<a href=\"/profile/MaikHerold\" target=\"_blank\">Maik&nbsp;Herold</a>\r\n<br/>\r\nCo-Chair: \r\n<a "| __truncated__ ...
  ..$ IsInItinerary: logi [1:607] FALSE FALSE FALSE FALSE FALSE FALSE ...
  • It seems that ecpr_data is now a list that contains only 1 data.frame.
  • This data.frame seems to have a mix of clean data and HTML code
  • That means we need to do a little more parsing

Wrangling the data: first steps

As a first step, we pull the data out and save it in a new object:

ecpr_data_df <- json_parsed$data
ecpr_data_df$Papers[2]
[1] "<a href=\"/Events/Event/PaperDetails/70630\"target=\"_blank\">Critique of Critical Policy Discourse Analysis</a>\r\n<br/>\r\nAuthor: <a href=\"/profile/MichaelKranert\" target=\"_blank\">Michael&nbsp;Kranert</a>\r\n<br/>\r\n<br/>\r\n<a href=\"/Events/Event/PaperDetails/70631\"target=\"_blank\">Critique of Critical Policy Discourse Analysis</a>\r\n<br/>\r\nAuthor: <a href=\"/profile/DesGasper\" target=\"_blank\">Des&nbsp;Gasper</a>\r\n<br/>\r\n<br/>\r\n<a href=\"/Events/Event/PaperDetails/70632\"target=\"_blank\">An Editor Responds to Critique of Critical Policy Discourse Analysis</a>\r\n<br/>\r\nAuthor: <a href=\"/profile/MichaelFarrelly\" target=\"_blank\">Michael&nbsp;Farrelly</a>"

Now since we are used to looking at HTML not in its raw form, but in a browser, we can define a small function to write the HTML code to a file and ask the browser to render it:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(html, tmp)
  browseURL(tmp)
}
check_in_browser(ecpr_data_df$Papers[2])

Wrangling the data: Getting some data

html <- read_html(ecpr_data_df$Papers[2])
paper_title <- html |>
  html_elements("[href*='PaperDetails']") |>
  html_text2()
paper_title
[1] "Critique of Critical Policy Discourse Analysis"                      
[2] "Critique of Critical Policy Discourse Analysis"                      
[3] "An Editor Responds to Critique of Critical Policy Discourse Analysis"
authors <- html |>
  html_elements("[href*='profile']") |>
  html_text2()
authors
[1] "Michael Kranert"  "Des Gasper"       "Michael Farrelly"
author_urls <- html |>
  html_elements("[href*='profile']") |>
  html_attr("href") 
author_urls
[1] "/profile/MichaelKranert"  "/profile/DesGasper"      
[3] "/profile/MichaelFarrelly"

Wrangling the data: writing a parser

extract_papers <- function(html) {
  html <- read_html(html)

  paper_title <- html |>
    html_elements("[href*='PaperDetails']") |>
    html_text2()
  
  authors <- html |>
    html_elements("[href*='profile']") |>
    html_text2()
  
  author_urls <- html |>
    html_elements("[href*='profile']") |>
    html_attr("href") 
  
  tibble(paper_title, authors, author_urls)
}
extract_papers(ecpr_data_df$Papers[2])
# A tibble: 3 × 3
  paper_title                                                authors author_urls
  <chr>                                                      <chr>   <chr>      
1 Critique of Critical Policy Discourse Analysis             Michae… /profile/M…
2 Critique of Critical Policy Discourse Analysis             Des Ga… /profile/D…
3 An Editor Responds to Critique of Critical Policy Discour… Michae… /profile/M…

Wrangling the data: parsing the whole data

ecpr_data <- ecpr_data_df |>
  filter(Papers != "") |> # remove empty lines
  mutate(papers = map(Papers, extract_papers)) # loop over all sessions
Error in `mutate()`:
ℹ In argument: `papers = map(Papers, extract_papers)`.
Caused by error in `map()`:
ℹ In index: 2.
Caused by error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 5: Existing data.
• Size 6: Column at position 2.
ℹ Only values of size one are recycled.

Another error :( let’s investigate!

html <- ecpr_data_df |>
  filter(Papers != "") |> 
  slice(2) |> 
  pull(Papers) |> 
  read_html()

paper_title <- html |>
  html_elements("[href*='PaperDetails']") |>
  html_text2()

authors <- html |>
  html_elements("[href*='profile']") |>
  html_text2()

author_urls <- html |>
    html_elements("[href*='profile']") |>
    html_attr("href") 

paper_title
[1] "Elites in times of epochal change. Empirical evidence and theoretical insights in an intertemporal perspective"
[2] "Determinants of Cooperation Potentials in the German Bundestag"                                                
[3] "From elite consensus to asymmetric polarisation. Lessons learnt from the Hungarian case"                       
[4] "Band of Brothers: Projection of Power Centralization in Times of Crisis in Central European countries"         
[5] "The Rule of Populist Nationalists and Elite Decay in Poland and Hungary"                                       
authors
[1] "Heinrich Best"         "Ursula Hoffmann-Lange" "Gabriella Ilonszki"   
[4] "Gyorgy Lengyel"        "David Broul"           "Jan Pakulski"         
author_urls
[1] "/profile/HeinrichBest1"        "/profile/UrsulaHoffmann-Lange"
[3] "/profile/GabriellaIlonszki2"   "/profile/GyorgyLengyel"       
[5] "/profile/DavidBroul"           "/profile/JanPakulski"         

Makes sense that some papers have several authors, so let’s take another look at the HTML here:

ecpr_data_df |>
  filter(Papers != "") |> 
  slice(2) |> 
  pull(Papers) |> 
  cat()
<a href="/Events/Event/PaperDetails/67670"target="_blank">Elites in times of epochal change. Empirical evidence and theoretical insights in an intertemporal perspective</a>
<br/>
Author: <a href="/profile/HeinrichBest1" target="_blank">Heinrich&nbsp;Best</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/67671"target="_blank">Determinants of Cooperation Potentials in the German Bundestag</a>
<br/>
Author: <a href="/profile/UrsulaHoffmann-Lange" target="_blank">Ursula&nbsp;Hoffmann-Lange</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/67672"target="_blank">From elite consensus to asymmetric polarisation. Lessons learnt from the Hungarian case</a>
<br/>
Authors: <a href="/profile/GabriellaIlonszki2" target="_blank">Gabriella&nbsp;Ilonszki</a>, <a href="/profile/GyorgyLengyel" target="_blank">Gyorgy&nbsp;Lengyel</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/70860"target="_blank">Band of Brothers: Projection of Power Centralization in Times of Crisis in Central European countries</a>
<br/>
Author: <a href="/profile/DavidBroul" target="_blank">David&nbsp;Broul</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/71441"target="_blank">The Rule of Populist Nationalists and Elite Decay in Poland and Hungary</a>
<br/>
Author: <a href="/profile/JanPakulski" target="_blank">Jan&nbsp;Pakulski</a>

Annoyingly, there does not seem to be a good way to combine paper titles and authors from this HTML structure :(

Wrangling the data: building a better parser

On the other hand, html code is just text, so maybe we can manipulate the strings a little?

author_lines <- html |>
  as.character() |>
  strsplit(split = "\n") |>
  pluck(1) |>
  str_subset("^Author")
author_lines
[1] "Author: <a href=\"/profile/HeinrichBest1\" target=\"_blank\">Heinrich Best</a>\r"                                                                                    
[2] "Author: <a href=\"/profile/UrsulaHoffmann-Lange\" target=\"_blank\">Ursula Hoffmann-Lange</a>\r"                                                                     
[3] "Authors: <a href=\"/profile/GabriellaIlonszki2\" target=\"_blank\">Gabriella Ilonszki</a>, <a href=\"/profile/GyorgyLengyel\" target=\"_blank\">Gyorgy Lengyel</a>\r"
[4] "Author: <a href=\"/profile/DavidBroul\" target=\"_blank\">David Broul</a>\r"                                                                                         
[5] "Author: <a href=\"/profile/JanPakulski\" target=\"_blank\">Jan Pakulski</a>"                                                                                         

Now we have only 5 lines that mention an author!

Let’s try to parse them using a loop

map(author_lines, function(x) {
  html <- read_html(x)
  author_urls <- html |>
    html_elements("a") |>
    html_attr("href")
  
  authors <- html |>
    html_elements("a") |>
    html_text2()
  list(author_urls = author_urls, authors = authors)
})
[[1]]
[[1]]$author_urls
[1] "/profile/HeinrichBest1"

[[1]]$authors
[1] "Heinrich Best"


[[2]]
[[2]]$author_urls
[1] "/profile/UrsulaHoffmann-Lange"

[[2]]$authors
[1] "Ursula Hoffmann-Lange"


[[3]]
[[3]]$author_urls
[1] "/profile/GabriellaIlonszki2" "/profile/GyorgyLengyel"     

[[3]]$authors
[1] "Gabriella Ilonszki" "Gyorgy Lengyel"    


[[4]]
[[4]]$author_urls
[1] "/profile/DavidBroul"

[[4]]$authors
[1] "David Broul"


[[5]]
[[5]]$author_urls
[1] "/profile/JanPakulski"

[[5]]$authors
[1] "Jan Pakulski"

This gives us a list with the right number of elements, while some elements contain several authors.

Wrangling the data: building a better parser

extract_papers <- function(html) {
  html <- read_html(html)

  paper_title <- html |>
    html_elements("[href*='PaperDetails']") |>
    html_text2()

  authors <- html |>
    as.character() |>
    strsplit(split = "\n") |>
    pluck(1) |>
    str_subset("^Author")

  profiles <- map(authors, function(x) {
    # some author fields are empty, we need to check for that
    if (nchar(x) <= 10L) {
      return(list(author_urls = NA_character_, authors = NA_character_))
    } else {
      html <- read_html(x)
      author_urls <- html |>
        html_elements("a") |>
        html_attr("href")

      authors <- html |>
        html_elements("a") |>
        html_text2()
      return(list(author_urls = author_urls, authors = authors))
    }
  })

  tibble(paper_title, authors = map(profiles, "authors"), author_urls = map(profiles, "author_urls"))

}
ecpr_data_df |>
  filter(Papers != "") |> 
  slice(2) |> 
  pull(Papers) |> 
  extract_papers() |> 
  unnest(c(authors, author_urls)) # unnest puts each paper into its own row, duplicating ids where necessary
# A tibble: 6 × 3
  paper_title                                                authors author_urls
  <chr>                                                      <chr>   <chr>      
1 Elites in times of epochal change. Empirical evidence and… Heinri… /profile/H…
2 Determinants of Cooperation Potentials in the German Bund… Ursula… /profile/U…
3 From elite consensus to asymmetric polarisation. Lessons … Gabrie… /profile/G…
4 From elite consensus to asymmetric polarisation. Lessons … Gyorgy… /profile/G…
5 Band of Brothers: Projection of Power Centralization in T… David … /profile/D…
6 The Rule of Populist Nationalists and Elite Decay in Pola… Jan Pa… /profile/J…

Wrangling the data: final steps

ecpr_data <- ecpr_data_df |>
  filter(Papers != "") |> 
  mutate(papers = map(Papers, extract_papers))
ecpr_data |>
  select(panel_id = ID, event_id = EventID, papers) |> 
  unnest(papers) 
# A tibble: 2,276 × 5
   panel_id event_id paper_title                             authors author_urls
      <int>    <int> <chr>                                   <list>  <list>     
 1    12045      214 Critique of Critical Policy Discourse … <chr>   <chr [1]>  
 2    12045      214 Critique of Critical Policy Discourse … <chr>   <chr [1]>  
 3    12045      214 An Editor Responds to Critique of Crit… <chr>   <chr [1]>  
 4    12572      214 Elites in times of epochal change. Emp… <chr>   <chr [1]>  
 5    12572      214 Determinants of Cooperation Potentials… <chr>   <chr [1]>  
 6    12572      214 From elite consensus to asymmetric pol… <chr>   <chr [2]>  
 7    12572      214 Band of Brothers: Projection of Power … <chr>   <chr [1]>  
 8    12572      214 The Rule of Populist Nationalists and … <chr>   <chr [1]>  
 9    12400      214 Does Political Polarization Affect Mig… <chr>   <chr [1]>  
10    12400      214 Beyond Party Identities: Mapping and A… <chr>   <chr [1]>  
# ℹ 2,266 more rows
ecpr_data_tidy <- ecpr_data |>
  select(panel_id = ID, event_id = EventID, papers) |>
  unnest(papers) |>
  unnest(c(authors, author_urls)) |>
  distinct(paper_title, authors, .keep_all = TRUE)

ecpr_data_tidy |>
  count(authors, sort = TRUE)
# A tibble: 2,897 × 2
   authors                        n
   <chr>                      <int>
 1 <NA>                           7
 2 Theodore Chadjipadelis         6
 3 André Bächtiger                5
 4 Arjan H. Schakel               5
 5 Peter H. Feindt                5
 6 Robert A. Huber                5
 7 Sebastian Ludwicki-Ziegler     5
 8 Aleh Cherp                     4
 9 Carolien van Ham               4
10 Carolina Galais                4
# ℹ 2,887 more rows

Exercises 3

  1. Can you find older iterations of the ECPR conference? How would you scrape these programmes?

Example: 2023 CompText (a special case)

What do we want

  • General goal in the course: we want to build a database of conference attendance and link this to researchers
  • So for each website:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)
  • In the next session we will also collect some researcher data

However:

Scraping data from PDFs?

  • Data inside a PDF is actually not such an uncommon case
  • Many institutions share PDFs with tables, images and lists of data
  • We can use some of our new pattern finding skills to scrape data from these PDFs as well though
    • Session names seem to be in a larger font and bold
    • Paper titles are in italics
    • Authors are either bold or plain font

Let’s investigate the PDF a little

library(pdftools)
comptext <- pdf_data("https://www.comptextconference.org/wp-content/uploads/2023/05/COMPTEXT-2023-programme-May-9-2023.pdf", font_info = TRUE)
comptext[[7]]
# A tibble: 345 × 8
   width height     x     y space text           font_name       font_size
   <int>  <int> <int> <int> <lgl> <chr>          <chr>               <dbl>
 1    46     12    63    63 TRUE  Friday,        WMKLBZ+SFBX1440     14.3 
 2    30     12   116    63 TRUE  May            WMKLBZ+SFBX1440     14.3 
 3    15     12   152    63 FALSE 12             WMKLBZ+SFBX1440     14.3 
 4    19      8   318    78 TRUE  ‘The           VQHLVV+SFTI1000      9.96
 5    64      8   342    78 TRUE  Representation VQHLVV+SFTI1000      9.96
 6     8      8   410    78 TRUE  of             VQHLVV+SFTI1000      9.96
 7    66      8   422    78 TRUE  Gender-Related VQHLVV+SFTI1000      9.96
 8    25      8   493    78 TRUE  Issues         VQHLVV+SFTI1000      9.96
 9     8      8   522    78 FALSE in             VQHLVV+SFTI1000      9.96
10    49      8   318    90 TRUE  Politicians’   VQHLVV+SFTI1000      9.96
# ℹ 335 more rows

We see here that:

  • each page is an element in a list
  • each word is in one row of the table
  • it contains the font_size and font_name
  • the position of each word on tha page is given with x and y coordinates

Let’s investigate a few words we saw above:

# a panel name
comptext[[7]] |> 
  filter(str_detect(text, "Intra-Party"))
# A tibble: 1 × 8
  width height     x     y space text        font_name       font_size
  <int>  <int> <int> <int> <lgl> <chr>       <chr>               <dbl>
1    56      8   109   179 TRUE  Intra-Party ZGRRLU+SFBX1000      9.96
# a paper title
comptext[[7]] |> 
  filter(str_detect(text, "Politicization"))
# A tibble: 1 × 8
  width height     x     y space text           font_name       font_size
  <int>  <int> <int> <int> <lgl> <chr>          <chr>               <dbl>
1    57      8    87   245 TRUE  Politicization VQHLVV+SFTI1000      9.96
# a speaker name
comptext[[7]] |> 
  filter(str_detect(text, "Alberto"))
# A tibble: 2 × 8
  width height     x     y space text    font_name       font_size
  <int>  <int> <int> <int> <lgl> <chr>   <chr>               <dbl>
1    38      8    63   353 TRUE  Alberto ZGRRLU+SFBX1000      9.96
2    33      8   353   637 FALSE Alberto ZXOBMV+SFRM1000      9.96
# a co-author name
comptext[[7]] |> 
  filter(str_detect(text, "Agnieszka"))
# A tibble: 1 × 8
  width height     x     y space text      font_name       font_size
  <int>  <int> <int> <int> <lgl> <chr>     <chr>               <dbl>
1    43      8    63   424 TRUE  Agnieszka ZXOBMV+SFRM1000      9.96
# a word in the right column
comptext[[7]] |> 
  filter(str_detect(text, "Yael"))
# A tibble: 1 × 8
  width height     x     y space text  font_name       font_size
  <int>  <int> <int> <int> <lgl> <chr> <chr>               <dbl>
1    21      8   318   102 TRUE  Yael  ZGRRLU+SFBX1000      9.96
# a word on the right edge of the left column
comptext[[7]] |> 
  filter(str_detect(text, "Innova-"))
# A tibble: 1 × 8
  width height     x     y space text    font_name       font_size
  <int>  <int> <int> <int> <lgl> <chr>   <chr>               <dbl>
1    32      8   244   109 FALSE Innova- ZXOBMV+SFRM1000      9.96

It looks like we can distinguish some types of information by the font_name and the info at which y and x position a word appears.

Combining the columns

As a first step, we want the text in the left column to appear before the text in the right column. We can use the x position for this.

tidy_page <- function(page) {
  left_column <- page |> 
    filter(x < 300) |> 
    group_by(y) |> 
    summarise(text = paste(text, collapse = " "),
              font_name = list(unique(font_name)))
  
  right_column <- page |> 
    filter(x > 300) |> 
    group_by(y) |> 
    summarise(text = paste(text, collapse = " "),
              font_name = list(unique(font_name)))
  
  bind_rows(left_column, right_column)
    
}

Finding and using patterns

comptext_data <- comptext[7:14] |> 
  # bring the text in the correct order on each page
  map(tidy_page) |> 
  bind_rows() |> 
  # paper titles use a specific font, which make them easy to identify
  mutate(paper_title = font_name == "VQHLVV+SFTI1000") |> 
  # lines with more than one font (bold and non-bold) are meta information, let's note that
  mutate(font_name = map_chr(font_name, function(x) ifelse(length(x) > 1, "meta", x))) |> 
  # Panels start with the string panel and a large bold font
  mutate(panel_start = font_name == "ZGRRLU+SFBX1000" & str_detect(text, "Panel")) |> 
  # using cumsum, we count the TRUE values. One appears at the start of each panel
  mutate(panel = cumsum(panel_start)) |> 
  # we are not interested in the information before the first panel
  filter(panel > 0) |> 
  group_by(panel) |> 
  # we can count the first appearance of a paper title to give the paper a unique id per panel
  mutate(paper_nr = cumsum(paper_title & !lag(paper_title))) |> 
  # now we can group the data using this id to make sure the authors are matched to the right paper
  group_by(panel, paper_nr) |> 
  # summarised takes several elements and uses a function to return exactly one
  summarise(
    # we select all text where the paper_nr == 0, which is te panel title
    panel_title = paste(text[paper_nr == 0], collapse = " "),
    # now we combine the text in the rows that have a paper title
    paper_title = paste(text[paper_title], collapse = " "),
    # finally, we put the authors in a list. They are the text within a panel descript that uses one of two fonts
    author = list(text[font_name %in% c("ZXOBMV+SFRM1000", "ZGRRLU+SFBX1000")]),
    .groups = "drop"
  ) |> 
  # we replace empty rows with NAs so we can fill them with the preeeding text
  mutate(across(panel_title:paper_title, function(x) ifelse(x == "", NA_character_, x))) |> 
  fill(panel_title) |> 
  filter(paper_nr > 0) |> 
  unnest(author)
comptext_data
# A tibble: 259 × 5
   panel paper_nr panel_title                                 paper_title author
   <int>    <int> <chr>                                       <chr>       <chr> 
 1     1        1 Panel 1: Intra-Party Politics Location: Co… ‘The Polit… Álvar…
 2     1        1 Panel 1: Intra-Party Politics Location: Co… ‘The Polit… Mathi…
 3     1        1 Panel 1: Intra-Party Politics Location: Co… ‘The Polit… Alexa…
 4     1        2 Panel 1: Intra-Party Politics Location: Co… ‘From MP t… Alber…
 5     1        3 Panel 1: Intra-Party Politics Location: Co… ‘Ideologic… Huber…
 6     1        3 Panel 1: Intra-Party Politics Location: Co… ‘Ideologic… Agnie…
 7     1        3 Panel 1: Intra-Party Politics Location: Co… ‘Ideologic… Vikto…
 8     2        1 Panel 2: Web Scraping Location: Conference… ‘Towards a… Alona…
 9     2        1 Panel 2: Web Scraping Location: Conference… ‘Towards a… Lena …
10     2        1 Panel 2: Web Scraping Location: Conference… ‘Towards a… Will …
# ℹ 249 more rows

Homework

You have seen some tools and tricks to scrape websites now. But your best ally in web scraping is experience! Until tomorrow noon, your task is to find some conference you could imagine attending and scrape their website to get the same information we extracted today. Even if you don’t fully succeed, document the steps you take and note down where the information can be found. If you collect raw html in R and the data is not where it should be (e.g., the html elements containing panel names do not exist), you might have discovered a more advanced site. Note that down and try another conference.

Deadline: Thursday noon

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pdftools_3.3.3  jsonlite_1.8.7  lubridate_1.9.2 forcats_1.0.0  
 [5] stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1     readr_2.1.4    
 [9] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0
[13] rvest_1.0.3    

loaded via a namespace (and not attached):
 [1] gtable_0.3.3      selectr_0.4-2     qpdf_1.3.2        compiler_4.3.1   
 [5] Rcpp_1.0.11       tidyselect_1.2.0  xml2_1.3.5        scales_1.2.1     
 [9] yaml_2.3.7        fastmap_1.1.1     R6_2.5.1          generics_0.1.3   
[13] curl_5.0.1        knitr_1.43        munsell_0.5.0     pillar_1.9.0     
[17] tzdb_0.4.0        rlang_1.1.1       utf8_1.2.3        stringi_1.7.12   
[21] xfun_0.39         timechange_0.2.0  cli_3.6.1         withr_2.5.0      
[25] magrittr_2.0.3    digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0
[29] askpass_1.1       hms_1.1.3         lifecycle_1.0.3   vctrs_0.6.3      
[33] evaluate_0.21     glue_1.6.2        codetools_0.2-19  fansi_1.0.4      
[37] colorspace_2.1-0  rmarkdown_2.23    httr_1.4.6        tools_4.3.1      
[41] pkgconfig_2.0.3   htmltools_0.5.5